In [ ]:
from google.colab import drive
drive.mount("/content/gdrive/")
%cd /content/gdrive/MyDrive/data_science/mile_stone2/
!ls
Mounted at /content/gdrive/
/content/gdrive/MyDrive/data_science/mile_stone2
'Copy of milestone2.ipynb'   milestone2.html   milestone2.ipynb

Stocks trading recommendation strategies project¶

By Tianyi Xu(https://xutianyi01.github.io/)¶

Performing an analysis on the historical and earnings data of over 50 tech companies, visualizing trends, and patterns. The goal is to find the most crucial features to do best trading recommendation strategies or stock price prediction.

The Data We're Using and Questions¶

I find three datasets realted to this project. One dataset is Stock History about stock prices and volumes, two are about Earning Report: one is from Macrotrends(https://www.macrotrends.net/) and the other is from Yahoo Finance(https://finance.yahoo.com/).

Stock History¶

I got my hands on the historical data of these companies as TABLE1, including their stock prices and volumes over the years, using a library called yfinance(https://pypi.org/project/yfinance/). TABLE1 is the main dataset we need to focus on, it will describe how the stock price move.

Earnings Reports¶

We’re also looking at earnings data from Macrotrends(TABLE2) and Yahoo Finance(TABLE3) to get an idea of how these companies are doing money-wise, and if they're growing or not. From Macrotrends, I’ve got:

  • Date: When the earnings report was released.
  • Revenue (in millions): Tells us how much money the company raked in.

and we will think about the question: Is Revenue the most important feature to decide the how the stock moves? What is the relationship between this feature and other features?

From Yahoo Finance, I’ve got the earnings calendar that includes:

  • Earnings Date: the date when the earnings were announced.
  • EPS Estimate: The predicted Earnings Per Share (EPS) - basically, how much profit the company is expected to make for each share.
  • Reported EPS(EPS Actual): The actual EPS.
  • Surprise(%): measures the difference between actual earnings and analyst estimates.

and we will think about the question: Is Reported EPS the most important feature to decide the how the stock moves? Compared with other features, Is this a better determinant of stock movement?

Trend Indicators¶

I’m adding in some trend indicators like MACD, RSI, and Bollinger Bands to help us spot patterns and trends in the data more easily by using pandas_ta(https://github.com/twopirllc/pandas-ta) library. We tidy TABLE1 and can get these indicators, and we will consider questions: Which one of MACD, RSI, Bollinger Bands is the important feature to help us decide the the how to operate stocks strategies? What is the relationship between one feature and other features?

How we’re working together¶

Currently, only me do the project. Since no others to work together, I plan to use Google Drive and Google Colab as my development platform.

Project Plan¶

Milestone1: 1. get three datasets used for future 2.Tidy each table 3. do three EDAs for three tables

Milestone2: 1. consturct learning models, for example, Contextual Bandits or Deep Reinforcement Learning as the model to do trading recommendation strategies. 2. do more EDAs for useful features extraction.

Final: 1. test different features 2. compare the final results and conclusion

Import all the Necessary Libraries¶

We use yfinance(https://pypi.org/project/yfinance/) to get the stock daily prices data and use pandas_ta(https://github.com/twopirllc/pandas-ta) to get indicators to be used for future analysis

In [ ]:
!pip install pandas_ta
!pip install python-dateutil
import yfinance as yf
import pandas as pd
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
import pandas_ta as ta
from dateutil import parser
import time
import seaborn as sns
import warnings
import numpy as np
warnings.filterwarnings('ignore')
Collecting pandas_ta
  Downloading pandas_ta-0.3.14b.tar.gz (115 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.1/115.1 kB 2.0 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from pandas_ta) (1.5.3)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->pandas_ta) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->pandas_ta) (2023.3.post1)
Requirement already satisfied: numpy>=1.21.0 in /usr/local/lib/python3.10/dist-packages (from pandas->pandas_ta) (1.23.5)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->pandas_ta) (1.16.0)
Building wheels for collected packages: pandas_ta
  Building wheel for pandas_ta (setup.py) ... done
  Created wheel for pandas_ta: filename=pandas_ta-0.3.14b0-py3-none-any.whl size=218907 sha256=bc5537b435ac1f4ced1898faf4c058abffd7480744234cac2ae62e72eabac7ed
  Stored in directory: /root/.cache/pip/wheels/69/00/ac/f7fa862c34b0e2ef320175100c233377b4c558944f12474cf0
Successfully built pandas_ta
Installing collected packages: pandas_ta
Successfully installed pandas_ta-0.3.14b0
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.10/dist-packages (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil) (1.16.0)

TABLE1: Get the original dataset from Yahoo Finance for one stock APPLE¶

Columns Open, High, Low, Close, Adj Close are about stock price. The volume is the number of shares that were traded during a given day.

In [ ]:
# Load data from Yahoo Finance
ticker = "AAPL"
stock_data = yf.download(ticker, start="2020-01-01", end="2023-01-01")

# Display the data
display(stock_data.head())

print(stock_data.shape)
display(stock_data.dtypes)
[*********************100%%**********************]  1 of 1 completed
Open High Low Close Adj Close Volume
Date
2020-01-02 74.059998 75.150002 73.797501 75.087502 73.249016 135480400
2020-01-03 74.287498 75.144997 74.125000 74.357498 72.536888 146322800
2020-01-06 73.447502 74.989998 73.187500 74.949997 73.114883 118387200
2020-01-07 74.959999 75.224998 74.370003 74.597504 72.771027 108872000
2020-01-08 74.290001 76.110001 74.290001 75.797501 73.941628 132079200
(756, 6)
Open         float64
High         float64
Low          float64
Close        float64
Adj Close    float64
Volume         int64
dtype: object

Above dataset is the example of one stock, now we extend it to 50 stocks

Specifying the Stock Tickers¶

We have chosen 50 tech company tickers for our analysis.

In [ ]:
tickers = [
    "AAPL", "MSFT", "AMZN", "GOOGL", "META", "TSLA", "NVDA", "PYPL", "INTC", "CSCO",
    "ADBE", "CRM", "ACN", "TXN", "ORCL", "QCOM", "IBM", "AVGO", "AMD", "SAP",
    "SHOP", "NOW", "SQ", "INTU", "FIS", "VMW", "HPE", "HPQ", "MU", "AMAT",
    "LRCX", "KLAC", "ADI", "MCHP", "CDNS", "ANSS", "KEYS", "TEL", "WDAY", "CTSH",
    "DOCU", "SNPS", "FTNT", "GLW", "VRSN", "AKAM", "SWKS", "TTD", "EPAM", "NET"
]

TABLE1 Tidy: Add New Indicators¶

MACD (Moving Average Convergence Divergence): MACD is a trend-following momentum indicator that shows the relationship between two moving averages of a stock’s price.

MACD Signal: The MACD signal line is a 9-day EMA of the MACD. When the MACD crosses above the signal line, it gives a bullish (buy) signal, and when it crosses below, it gives a bearish (sell) signal.

MACD Histogram: The MACD histogram is the difference between the MACD and its signal line. A positive histogram suggests that bulls are in control (buying pressure), while a negative histogram suggests bears have control (selling pressure).

RSI (Relative Strength Index): RSI measures the magnitude of recent price changes to evaluate overbought or oversold conditions in the price of a stock. Readings above 70 indicate that a stock may be overbought, and readings below 30 indicate that the stock may be oversold.

Bollinger Upper Band: The upper Bollinger Band is two standard deviations above the simple moving average (SMA) of a stock's price. It helps identify when a stock’s price is considered "high."

Bollinger Middle Band: The middle Bollinger Band is the SMA of the stock’s price, often serving as support or resistance levels for the stock price.

Bollinger Lower Band: The lower Bollinger Band is two standard deviations below the SMA of the stock’s price. It helps identify when a stock’s price is considered "low."

In [ ]:
def fetch_and_enhance_stock_data(tickers):
    all_data = []

    for ticker in tickers:
        try:
            # Fetch historical stock data
            stock_data = yf.download(ticker, start="2010-01-01", end="2023-01-01")

            # Calculate MACD
            macd = ta.macd(stock_data['Close'])
            stock_data['MACD'] = macd.iloc[:, 0]
            stock_data['MACD Signal'] = macd.iloc[:, 1]
            stock_data['MACD Histogram'] = macd.iloc[:, 2]

            # Calculate RSI
            stock_data['RSI'] = ta.rsi(stock_data['Close'])

            # Calculate Bollinger Bands
            bbands = ta.bbands(stock_data['Close'])
            stock_data['Bollinger Lower'] = bbands.iloc[:, 0]
            stock_data['Bollinger Middle'] = bbands.iloc[:, 1]
            stock_data['Bollinger Upper'] = bbands.iloc[:, 2]

            stock_data['Stock'] = ticker
            cols = ['Stock'] + [col for col in stock_data if col != 'Stock']
            stock_data = stock_data[cols]
            # Drop nan
            all_data.append(stock_data.dropna())
        except Exception as e:
            print(f"Could not retrieve data for {ticker}: {str(e)}")
            continue

    # Concatenate all the individual DataFrames into one
    combined_data = pd.concat(all_data, axis=0)

    return combined_data


# Fetch and enhance the stock data
stock_data = fetch_and_enhance_stock_data(tickers)

table1 = stock_data.copy()

display(stock_data.head(10))
print(stock_data.shape)
display(stock_data.dtypes)
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
Stock Open High Low Close Adj Close Volume MACD MACD Signal MACD Histogram RSI Bollinger Lower Bollinger Middle Bollinger Upper
Date
2010-02-22 AAPL 7.226429 7.232143 7.113929 7.157857 6.075565 390563600 -0.088343 0.066146 -0.154489 47.617442 7.145997 7.221214 7.296432
2010-02-23 AAPL 7.142857 7.190357 6.989643 7.037857 5.973710 575094800 -0.091067 0.050738 -0.141805 43.102673 7.024712 7.175929 7.327145
2010-02-24 AAPL 7.079643 7.194286 7.065714 7.166429 6.082839 460566400 -0.081906 0.047918 -0.129825 48.713452 7.022721 7.162429 7.302136
2010-02-25 AAPL 7.049286 7.245000 7.031786 7.214286 6.123461 665126000 -0.069979 0.047877 -0.117856 50.663657 7.030473 7.155786 7.281099
2010-02-26 AAPL 7.227857 7.327500 7.214286 7.307857 6.202885 507460800 -0.052372 0.052387 -0.104759 54.321085 7.001694 7.176857 7.352020
2010-03-01 AAPL 7.348214 7.482143 7.337500 7.463929 6.335357 550093600 -0.025530 0.063383 -0.088913 59.688942 6.952883 7.238072 7.523261
2010-03-02 AAPL 7.497500 7.529643 7.419286 7.458929 6.331113 566546400 -0.004608 0.067444 -0.072052 59.447917 7.077504 7.322286 7.567068
2010-03-03 AAPL 7.462143 7.495357 7.426429 7.476071 6.345663 372052800 0.013204 0.068205 -0.055001 60.043600 7.174299 7.384214 7.594130
2010-03-04 AAPL 7.474286 7.532857 7.451071 7.525357 6.387496 366041200 0.030940 0.068753 -0.037813 61.781887 7.300079 7.446429 7.592779
2010-03-05 AAPL 7.676429 7.846429 7.665357 7.819643 6.637286 899620400 0.067959 0.084617 -0.016658 70.136225 7.273868 7.548786 7.823704
(146264, 14)
Stock                object
Open                float64
High                float64
Low                 float64
Close               float64
Adj Close           float64
Volume                int64
MACD                float64
MACD Signal         float64
MACD Histogram      float64
RSI                 float64
Bollinger Lower     float64
Bollinger Middle    float64
Bollinger Upper     float64
dtype: object

TABLE2: Function to Fetch Earnings Data from Macrotrends¶

A function to scrape earnings data from Macrotrends(https://www.macrotrends.net/) for each stock. We can use Apple Quarterly Revenue as a long-term investment decision.

In [ ]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}

def fetch_macrotrends_earnings_data(ticker):
    url = f'https://www.macrotrends.net/stocks/charts/{ticker.lower()}/{ticker}/revenue'
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        print(f"Failed to retrieve the data for {ticker}")
        return None

    soup = BeautifulSoup(response.text, 'html.parser')
    table = soup.find_all('table')[1]
    earnings_data = pd.read_html(str(table))[0]

    return earnings_data

apple_earnings_data = fetch_macrotrends_earnings_data('AAPL')
display(apple_earnings_data.head(10))
print(apple_earnings_data.shape)
display(apple_earnings_data.dtypes)
Apple Quarterly Revenue (Millions of US $) Apple Quarterly Revenue (Millions of US $).1
0 2023-06-30 $81,797
1 2023-03-31 $94,836
2 2022-12-31 $117,154
3 2022-09-30 $90,146
4 2022-06-30 $82,959
5 2022-03-31 $97,278
6 2021-12-31 $123,945
7 2021-09-30 $83,360
8 2021-06-30 $81,434
9 2021-03-31 $89,584
(58, 2)
Apple Quarterly Revenue (Millions of US $)      object
Apple Quarterly Revenue (Millions of US $).1    object
dtype: object

TABLE2 Tidy: Clean data and combine all 50 stocks¶

In [ ]:
def clean_macrotrends_earnings_data(df, ticker):
    df.columns = ['Date', 'Revenue']
    df['Date'] = pd.to_datetime(df['Date'])
    df['Revenue'] = df['Revenue'].str.replace('$', '').str.replace(',', '').astype(float)
    df['Stock'] = ticker
    return df

# Fetch and clean earnings data for all stocks
all_earnings_data = []

for ticker in tickers:
    raw_earnings_data = fetch_macrotrends_earnings_data(ticker)
    if raw_earnings_data is not None:
        cleaned_earnings_data = clean_macrotrends_earnings_data(raw_earnings_data, ticker)
        all_earnings_data.append(cleaned_earnings_data)

# Combine all earnings data into a single DataFrame
combined_earnings_data = pd.concat(all_earnings_data, axis=0)

table2 = combined_earnings_data.copy()
In [ ]:
display(combined_earnings_data.head(10))
print(combined_earnings_data.shape)
display(combined_earnings_data.dtypes)
Date Revenue Stock
0 2023-06-30 81797.0 AAPL
1 2023-03-31 94836.0 AAPL
2 2022-12-31 117154.0 AAPL
3 2022-09-30 90146.0 AAPL
4 2022-06-30 82959.0 AAPL
5 2022-03-31 97278.0 AAPL
6 2021-12-31 123945.0 AAPL
7 2021-09-30 83360.0 AAPL
8 2021-06-30 81434.0 AAPL
9 2021-03-31 89584.0 AAPL
(2707, 3)
Date       datetime64[ns]
Revenue           float64
Stock              object
dtype: object

EDA for TABLE2:¶

We can find that Apple's quarterly revenue is the highest compared with MSFT and GOOGL. And it is increasing year by year, we can infer that Apple's stock price should rise year by year and the increase rate may be higher than MSFT and GOOGL.

In [ ]:
# Display basic statistics
print(combined_earnings_data.describe())

# Plotting the revenue over time for Apple
plt.figure(figsize=(10, 6))
apple_data = combined_earnings_data[combined_earnings_data['Stock'] == 'AAPL']
plt.plot(apple_data['Date'], apple_data['Revenue'], marker='o', linestyle='-', color='b')
plt.title('Apple Quarterly Revenue Over Time')
plt.xlabel('Date')
plt.ylabel('Revenue (in millions)')
plt.grid(True)
plt.show()

# Plotting the revenue distributions for Apple, Microsoft, and Google
selected_stocks = ['AAPL', 'MSFT', 'GOOGL']
selected_data = combined_earnings_data[combined_earnings_data['Stock'].isin(selected_stocks)]

plt.figure(figsize=(10, 6))
sns.boxplot(x='Stock', y='Revenue', data=selected_data)
plt.title('Revenue Distribution for Selected Stocks')
plt.xlabel('Stock')
plt.ylabel('Revenue (in millions)')
plt.show()
             Revenue
count    2679.000000
mean     7341.410228
std     14817.978635
min    -26397.000000
25%       705.000000
50%      2230.000000
75%      7286.500000
max    149204.000000

TABLE3: Fetching Earnings Data from Yahoo Finance¶

A function to scrape earnings data from Yahoo Finance(https://finance.yahoo.com/) for each stock. We can use Reported EPS as a long-term investment decision.

In [ ]:
def fetch_yahoo_earnings_data(ticker):
    url = f'https://finance.yahoo.com/calendar/earnings?symbol={ticker}'
    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        print(f"Failed to retrieve the data for {ticker}")
        return None

    soup = BeautifulSoup(response.text, 'html.parser')
    table = soup.find('table')
    earnings_data = pd.read_html(str(table))[0]

    return earnings_data

msft_yahoo_earnings_data = fetch_yahoo_earnings_data('MSFT')
display(msft_yahoo_earnings_data.head(10))
print(msft_yahoo_earnings_data.shape)
display(msft_yahoo_earnings_data.dtypes)
Symbol Company Earnings Date EPS Estimate Reported EPS Surprise(%)
0 MSFT Microsoft Corp Oct 22, 2024, 6 AMEDT - - -
1 MSFT Microsoft Corp Jul 23, 2024, 6 AMEDT - - -
2 MSFT Microsoft Corp Apr 23, 2024, 6 AMEDT - - -
3 MSFT Microsoft Corp Jan 22, 2024, 4 PMEST 2.78 - -
4 MSFT Microsoft Corp Jan 22, 2024, 5 AMEST 2.78 - -
5 MSFT Microsoft Corporation Oct 24, 2023, 12 PMEDT 2.65 2.99 +12.7
6 MSFT Microsoft Corporation Jul 25, 2023, 12 PMEDT 2.55 2.69 +5.49
7 MSFT Microsoft Corporation Apr 25, 2023, 12 PMEDT 2.23 2.45 +9.81
8 MSFT Microsoft Corporation Jan 24, 2023, 11 AMEST 2.29 2.32 +1.09
9 MSFT Microsoft Corporation Oct 25, 2022, 12 PMEDT 2.3 2.35 +2.05
(100, 6)
Symbol           object
Company          object
Earnings Date    object
EPS Estimate     object
Reported EPS     object
Surprise(%)      object
dtype: object

TABLE3 Tidy: Clean above table¶

In [ ]:
def parse_date(date_string):
    try:
        return parser.parse(date_string)
    except Exception as e:
        return None



def clean_yahoo_earnings_data(df, ticker):
    cleaned_data = []
    for index, row in df.iterrows():
        try:
            if row['EPS Estimate'] == '-' or row['Reported EPS'] == '-' or row['Surprise(%)'] == '-':
                continue

            date_string = row['Earnings Date']

            earnings_date = parse_date(date_string)
            eps_estimate = float(row['EPS Estimate']) if row['EPS Estimate'] != '-' else None
            reported_eps = float(row['Reported EPS']) if row['Reported EPS'] != '-' else None
            surprise = float(row['Surprise(%)'].replace('%', '')) if row['Surprise(%)'] != '-' else None

            cleaned_data.append([ticker, earnings_date, eps_estimate, reported_eps, surprise])
        except Exception as e:
            continue


    cleaned_df = pd.DataFrame(cleaned_data, columns=['Stock', 'Earnings Date', 'EPS Estimate', 'Reported EPS', 'Surprise(%)'])
    return cleaned_df

TABLE3 Tidy: Combine all 50 stocks¶

In [ ]:
all_yahoo_earnings_data = []

for ticker in tickers:
    raw_yahoo_earnings_data = fetch_yahoo_earnings_data(ticker)

    if raw_yahoo_earnings_data is not None and not raw_yahoo_earnings_data.empty:
        cleaned_yahoo_earnings_data = clean_yahoo_earnings_data(raw_yahoo_earnings_data, ticker)

        if not cleaned_yahoo_earnings_data.empty:
            all_yahoo_earnings_data.append(cleaned_yahoo_earnings_data)
        time.sleep(1)

if all_yahoo_earnings_data:
    combined_yahoo_earnings_data = pd.concat(all_yahoo_earnings_data, ignore_index=True)
else:
    print("No data to display.")
In [ ]:
table3 = combined_yahoo_earnings_data.copy()

display(combined_yahoo_earnings_data.head(10))
print(combined_yahoo_earnings_data.shape)
display(combined_yahoo_earnings_data.dtypes)
Stock Earnings Date EPS Estimate Reported EPS Surprise(%)
0 AAPL 2023-11-02 12:00:00 1.39 1.46 4.89
1 AAPL 2023-08-03 12:00:00 1.19 1.26 5.49
2 AAPL 2023-05-04 12:00:00 1.43 1.52 6.03
3 AAPL 2023-02-02 11:00:00 1.94 1.88 -2.88
4 AAPL 2022-10-27 12:00:00 1.27 1.29 1.55
5 AAPL 2022-07-28 12:00:00 1.16 1.20 3.25
6 AAPL 2022-04-28 12:00:00 1.43 1.52 6.44
7 AAPL 2022-01-27 11:00:00 1.89 2.10 11.17
8 AAPL 2021-10-28 12:00:00 1.24 1.24 0.30
9 AAPL 2021-07-27 12:00:00 1.01 1.30 29.12
(3431, 5)
Stock                    object
Earnings Date    datetime64[ns]
EPS Estimate            float64
Reported EPS            float64
Surprise(%)             float64
dtype: object

EDA for TABLE3:¶

We can find that Reported EPS is increasing over years, so we can infer that Apple's stock price should rise year by year. One interesting thing is that despite Apple's quarterly revenue is the higher compared with MSFT and GOOGL, but Apple's Reported EPS is lower than MSFT. So we want to plot the real stock movements in the final part to compare Apple and Microsoft.

In [ ]:
print(combined_yahoo_earnings_data.describe())
print(combined_yahoo_earnings_data.isnull().sum())
combined_yahoo_earnings_data.dropna(inplace=True)

# After fetching and cleaning the data, filter it to include only AAPL’s data
aapl_earnings_data = combined_yahoo_earnings_data[combined_yahoo_earnings_data['Stock'] == 'AAPL']

# plotting Reported EPS over time
plt.figure(figsize=(10, 6))
sns.lineplot(data=aapl_earnings_data, x='Earnings Date', y='Reported EPS')
plt.title('AAPL Reported EPS Over Time')
plt.show()

# plotting the Surprise(%) over time
plt.figure(figsize=(10, 6))
sns.lineplot(data=aapl_earnings_data, x='Earnings Date', y='Surprise(%)')
plt.title('AAPL Earnings Surprise(%) Over Time')
plt.show()
       EPS Estimate  Reported EPS  Surprise(%)
count   3431.000000   3431.000000  3431.000000
mean       0.736721      0.787753    13.914261
std        1.073613      1.122662    74.110469
min       -1.580000     -1.910000 -1616.790000
25%        0.150000      0.170000     1.935000
50%        0.430000      0.460000     6.080000
75%        0.935000      1.000000    13.830000
max       10.420000     10.710000  1283.470000
Stock              0
Earnings Date    268
EPS Estimate       0
Reported EPS       0
Surprise(%)        0
dtype: int64
In [ ]:
# Compare AAPL's Reported EPS with other companies
others_earnings_data = combined_yahoo_earnings_data[combined_yahoo_earnings_data['Stock'] == 'MSFT']
google_earnings_data = combined_yahoo_earnings_data[combined_yahoo_earnings_data['Stock'] == 'GOOGL']

plt.figure(figsize=(10, 6))
sns.lineplot(data=aapl_earnings_data, x='Earnings Date', y='Reported EPS', marker="o", label='AAPL')
sns.lineplot(data=others_earnings_data, x='Earnings Date', y='Reported EPS', marker="o", label='MSFT')
sns.lineplot(data=google_earnings_data, x='Earnings Date', y='Reported EPS', marker="o", label='GOOGL')
plt.title('Comparing AAPL Reported EPS with Other Companies')
plt.grid(True)
plt.show()

EDA for TABLE1:¶

We compare the stock price variation between APPL and MSFT. The interesting thing is while Apple's stock has grown more, Microsoft's has risen by a larger multiple. So from this analysis, Reported EPS seems more important.

In [ ]:
# Extract stock data with AAPL and MSFT
stock_data = fetch_and_enhance_stock_data(['AAPL', 'MSFT'])
aapl_data = stock_data[stock_data['Stock'] == 'AAPL']
msft_data = stock_data[stock_data['Stock'] == 'MSFT']

# Plotting the data
plt.figure(figsize=(10,6))
sns.lineplot(data=aapl_data, x=aapl_data.index, y='Close', label='AAPL')
sns.lineplot(data=msft_data, x=msft_data.index, y='Close', label='MSFT')

plt.title('Stock Price Variation Over Years between AAPL and MSFT')
plt.ylabel('Stock Price')
plt.xlabel('Date')
plt.grid(True)
plt.legend()
plt.show()
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed

Milestone1 Conclusion¶

We summarize our findings, insights, and observations from the data analysis and visualizations. This can be expanded upon further analysis and interpretation. We can focus on Reported EPS and in the future we will consider more features and use our reinforcement learning model to compare the results between different features.

MileStone2¶

Merge three Tables¶

I combine all three datasets (Table1: stock_data, Table2: combined_earnings_data, and Table3: combined_yahoo_earnings_data) and conduct a deep EDA to uncover relationships between important features. This will inform the development of a future stock recommendation model, whether it's based on Reinforcement Learning (RL) or classic prediction models.

First, we'll merge the datasets on common columns (Date and Stock). Since Table1 and Table3 have a one-to-one relationship on the date of earnings release, we can merge them directly. The Table2 will be joined based on the nearest date since earnings data is quarterly and stock data is daily.

In [ ]:
table1_final = table1.copy()
table2_final = table2.copy()
table3_final = table3.copy()
In [ ]:
# Reset index for table1_final
table1_final.reset_index(inplace=True)

table1_final['Date'] = pd.to_datetime(table1_final['Date']).dt.date


# Convert 'Earnings Date' to datetime format
table3_final['Earnings Date'] = pd.to_datetime(table3_final['Earnings Date'])

# Extract just the date part from 'Earnings Date'
table3_final['Earnings Date'] = table3_final['Earnings Date'].dt.date

table3_final.rename(columns={'Earnings Date': 'Date'}, inplace=True)

Final Dataset¶

Here, after merging, the columns of 'EPS Estimate', 'Reported EPS', 'Surprise(%)', 'Revenue' will have Nan value, since EPS is reported quarterly but stock data is daily. We choose to drop Nan values because firstly we don't know the true variant EPS between different quarters and after droping Nan values we still have enough data.

In [ ]:
# Merge table1_final with table3_final on the Date
table1_final['Date'] = pd.to_datetime(table1_final['Date'])
table3_final['Date'] = pd.to_datetime(table3_final['Date'])

merged_data = pd.merge(table1_final, table3_final, how='left', left_on=['Date', 'Stock'], right_on=['Date', 'Stock'])


# Merge merged_data with table2_final on the nearest Date
table2_final['Date'] = pd.to_datetime(table2_final['Date'])
merged_data['Date'] = pd.to_datetime(merged_data['Date'])
merged_data = pd.merge_asof(merged_data.sort_values('Date'), table2_final.sort_values('Date'), by='Stock', on='Date', direction='nearest')


# Move 'Stock' column to the first position
cols = ['Stock'] + [col for col in merged_data.columns if col != 'Stock']
merged_data = merged_data[cols]

# Drop Nan
merged_data.dropna(subset=['EPS Estimate', 'Reported EPS', 'Surprise(%)', 'Revenue'], inplace=True)

merged_data.reset_index(drop=True, inplace=True)

# Display merged dataset
display(merged_data.head())

print(merged_data.shape)
display(merged_data.dtypes)
Stock Date Open High Low Close Adj Close Volume MACD MACD Signal MACD Histogram RSI Bollinger Lower Bollinger Middle Bollinger Upper EPS Estimate Reported EPS Surprise(%) Revenue
0 CRM 2010-02-24 17.080000 17.504999 17.022499 17.360001 17.360001 10460400 -0.075746 0.271555 -0.347302 59.118764 16.990480 17.204500 17.418519 0.04 0.04 6.89 354.0
1 AVGO 2010-02-24 17.500000 17.870001 17.410000 17.740000 12.838190 852300 -0.254461 0.146585 -0.401046 52.805132 17.210859 17.490000 17.769142 0.30 0.34 14.09 456.0
2 ANSS 2010-02-25 43.099998 43.639999 42.029999 43.619999 43.619999 462800 0.160317 0.380058 -0.219741 58.903683 42.637163 43.400000 44.162837 0.49 0.53 8.38 136.0
3 ADBE 2010-03-23 35.259998 35.419998 34.689999 35.220001 35.220001 8504500 0.240266 0.015448 0.224818 55.163065 34.452261 35.216000 35.979740 0.37 0.40 7.34 859.0
4 ACN 2010-03-25 42.099998 42.480000 41.500000 41.520000 32.030968 4643600 0.221055 0.011567 0.209488 48.319988 41.338574 42.040001 42.741428 0.61 0.60 -2.17 5538.0
(1866, 19)
Stock                       object
Date                datetime64[ns]
Open                       float64
High                       float64
Low                        float64
Close                      float64
Adj Close                  float64
Volume                       int64
MACD                       float64
MACD Signal                float64
MACD Histogram             float64
RSI                        float64
Bollinger Lower            float64
Bollinger Middle           float64
Bollinger Upper            float64
EPS Estimate               float64
Reported EPS               float64
Surprise(%)                float64
Revenue                    float64
dtype: object

EDA for final dataset¶

Summary Statistics:¶

Descriptive statistics for features.

In [ ]:
# Summary Statistics
print(merged_data.describe())
print(merged_data['Stock'].value_counts())
              Open         High          Low        Close    Adj Close  \
count  1866.000000  1866.000000  1866.000000  1866.000000  1866.000000   
mean     87.573812    88.862316    86.207862    87.596345    82.460398   
std      96.143209    97.640704    94.534965    96.116943    94.378782   
min       1.534000     1.544667     1.504000     1.522667     1.522667   
25%      27.435000    27.799999    26.800001    27.382751    23.735375   
50%      54.294498    55.224998    53.271999    54.395000    49.689671   
75%     113.280001   114.735001   111.997498   113.344999   104.345003   
max     675.700012   684.830017   663.609985   664.760010   664.760010   

             Volume         MACD  MACD Signal  MACD Histogram          RSI  \
count  1.866000e+03  1866.000000  1866.000000     1866.000000  1866.000000   
mean   3.679789e+07     0.450076     0.027224        0.422851    54.458778   
std    9.468753e+07     3.164273     0.930740        2.976642    11.593190   
min    8.830000e+04   -35.798676   -12.289617      -30.238577    12.441018   
25%    3.269575e+06    -0.179459    -0.102475       -0.196597    46.566632   
50%    9.460000e+06     0.255798     0.025647        0.251278    55.009380   
75%    3.531965e+07     1.034519     0.214732        0.936849    62.957277   
max    1.880998e+09    20.520009     6.946169       22.839408    83.747668   

       Bollinger Lower  Bollinger Middle  Bollinger Upper  EPS Estimate  \
count      1866.000000       1866.000000      1866.000000   1866.000000   
mean         84.660355         87.482621        90.304887      0.905418   
std          93.020314         96.339188        99.732342      1.147128   
min           1.449265          1.541333         1.561554     -0.500000   
25%          26.486310         27.268725        28.109967      0.290000   
50%          52.364663         54.250000        55.958033      0.600000   
75%         109.404774        112.955000       116.540095      1.080000   
max         662.692464        680.662012       698.631559     10.280000   

       Reported EPS  Surprise(%)        Revenue  
count   1866.000000  1866.000000    1866.000000  
mean       0.968617    13.986683    8008.525188  
std        1.194024    74.843757   15706.062694  
min       -0.870000 -1616.790000  -26397.000000  
25%        0.320000     2.012500     709.250000  
50%        0.640000     6.220000    1974.000000  
75%        1.170000    14.057500    7486.750000  
max       10.450000  1283.470000  137412.000000  
ORCL     52
MU       52
CRM      51
QCOM     51
AKAM     51
NVDA     51
AMAT     51
INTC     51
TXN      51
GOOGL    51
MSFT     51
AAPL     51
LRCX     51
HPQ      50
AVGO     50
KLAC     50
IBM      50
ADBE     49
FTNT     49
INTU     49
CDNS     49
SNPS     48
SWKS     48
VMW      48
VRSN     47
AMD      46
MCHP     44
AMZN     44
TSLA     42
META     42
ANSS     40
NOW      39
WDAY     39
CTSH     33
KEYS     32
HPE      28
PYPL     28
SQ       26
ADI      22
GLW      18
ACN      18
FIS      18
TTD      18
DOCU     17
EPAM     12
NET       8
Name: Stock, dtype: int64

Summary Statistics Analysis:¶

Stock Price (Open, High, Low, Close, Adj Close): The count is consistent across these variables, indicating no missing values. The mean, median (50%), and standard deviation (std) provide a sense of the central tendency and spread of the data. We can find the average closing price (mean) is about 87.60, but std is 96.11, suggesting that stock prices across different stocks vary widely.

Volume: The trading volume also varies greatly (std is much higher than the mean), and there's a huge range (min to max), indicating that some stocks are traded much more frequently than others.

Technical Indicators (MACD, RSI, Bollinger Bands): These are typical indicators used in stock price analysis. The RSI, for example, ranges from 12.44 to 83.74, with a mean of 54.46, suggesting that on average, stocks are neither overbought nor oversold.

EPS and Revenue: There's a lot of variability in EPS (Earnings Per Share) and revenue as well. The negative minimum values for Reported EPS and Revenue indicate losses for some quarters.

Surprise(%): This shows how much the actual earnings differed from the estimates. The huge range (from -1616.79% to 1283.47%) indicates some significant surprises in earnings reports.

For counting number, some stocks like 'NET' and 'EPAM' have fewer data points, which could be due to them being listed on the stock exchange more recently compared to others.

So base on the above analysis, the next steps in the EDA process will focus on visualizing and understanding the relationships and patterns in the data.

Top 10 Stocks by Variance:¶

From above statistics analysis, we know variance is really large. The figure below shows the top 10 stocks by variance.

In [ ]:
import numpy as np

# Calculate variance for each stock
stock_variance = merged_data.groupby('Stock')['Close'].var()

# Sort stocks by variance in descending order and pick top 10
top_10_stocks = stock_variance.sort_values(ascending=False).head(10).index

plt.figure(figsize=(15, 6))

# Plot only for top 10 stocks with highest variance
for stock in top_10_stocks:
    subset = merged_data[merged_data['Stock'] == stock]
    plt.plot(subset['Date'], subset['Close'], label=stock)

plt.xlabel('Date')
plt.ylabel('Close Price')
plt.title('Stock Prices Over Time for Top 10 Stocks by Variance')
plt.legend(loc='upper left', bbox_to_anchor=(1, 1))
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

From above results, we can find even the variance of excellent companies such as TSLA and META is also very large. In addition, it is not difficult to see that after 2022, some stocks will have a big decline while others like AVGO will have a small decline, which reflects the importance of stock trading strategy and the necessity of stock recommendation.

Distribution of Stock Prices¶

Visualize the distribution of stock prices to understand their range and common values.

In [ ]:
plt.figure(figsize=(10, 6))
sns.histplot(data=merged_data, x='Close', kde=True)
plt.title('Distribution of Closing Prices')
plt.show()

From above results, we can know that the right-skewed distribution of stock prices in our dataset suggests that most stocks are clustered at lower prices, with a few outliers exhibiting significantly higher prices. The concentration of stocks at lower price points may represent more accessible investment opportunities for a broader range of investors, while the higher-priced outliers might be suitable for investors with a larger capital base. We will do the Volume Analysis to verify our findings.

Volume Traded vs. Stock Price¶

Explore the relationship between the volume traded and the stock price, as high trading volumes can often lead to significant price movements.

In [ ]:
plt.figure(figsize=(10, 6))

sns.regplot(x='Volume', y='Close', data=merged_data, scatter_kws={'alpha':0.3}, lowess=True)

# Setting the x-axis to a logarithmic scale for better visualization
plt.xscale('log')

plt.title('Volume Traded vs. Closing Price')
plt.xlabel('Volume Traded')
plt.ylabel('Closing Price')
plt.show()

The plot indicates a concentration of data points towards the lower end of the y-axis (Closing Price) and a widespread along the x-axis (Volume Traded). This pattern suggests that stocks with lower closing prices tend to have a broader range of trading volumes, while those with higher closing prices have relatively less variation in their trading volumes. From above analysis, we consider the volumes and history of stock price as the import features to do the future recommendation. The stock with lower price with high volumes tends to trade more frequently.

MACD vs. Stock Price and RSI vs. Stock Price¶

Explore the relationship between the MACD or RSI and the stock price, as they are import indicators can often indicate significant price movements.

In [ ]:
# Plotting MACD vs. Stock Price
plt.figure(figsize=(14, 6))
plt.subplot(1, 2, 1)
sns.regplot(x='MACD', y='Close', data=merged_data, lowess=True, scatter_kws={'alpha':0.3})
plt.title('MACD vs. Closing Price')
plt.xlabel('MACD')
plt.ylabel('Closing Price')

# Plotting RSI vs. Stock Price
plt.subplot(1, 2, 2)
sns.regplot(x='RSI', y='Close', data=merged_data, lowess=True, scatter_kws={'alpha':0.3})
plt.title('RSI vs. Closing Price')
plt.xlabel('RSI')
plt.ylabel('Closing Price')

plt.tight_layout()
plt.show()

From above plots, the MACD (Moving Average Convergence Divergence) plot shows a upside down pointy end pattern centered around the zero line. When MACD crosses above zero, it often signals an increasing positive momentum (bullish trend), and stocks may see rising prices. This information can be crucial for a stock recommendation system.

RSI Analysis: The RSI (Relative Strength Index) plot appears as a flat, positive line, suggesting that there isn't a strong, clear relationship between RSI values and stock prices. This might indicate that RSI alone may not be a sufficient predictor for stock price movement in this dataset. RSI may need to be combined with other indicators or used in specific contexts to provide meaningful insights for stock recommendations. In the feature engineering, I will use two methods to do future tasks, ignoring RSI and combining it with other features to see the difference.

Reported EPS vs. Stock Price, Surprise(%) vs. Stock Price, Revenue vs.Stock Price¶

Explore the relationship between the Reported EPS or Surprise(%) or Revenue and the stock price.

In [ ]:
# Plotting Reported EPS vs. Stock Price
plt.figure(figsize=(18, 6))
plt.subplot(1, 3, 1)
sns.scatterplot(x='Reported EPS', y='Close', data=merged_data, alpha=0.3)
sns.regplot(x='Reported EPS', y='Close', data=merged_data, lowess=True, scatter=False, color='red')
plt.title('Reported EPS vs. Closing Price')
plt.xlabel('Reported EPS')
plt.ylabel('Closing Price')

# Plotting Surprise(%) vs. Stock Price
plt.subplot(1, 3, 2)
sns.scatterplot(x='Surprise(%)', y='Close', data=merged_data, alpha=0.3)
sns.regplot(x='Surprise(%)', y='Close', data=merged_data, lowess=True, scatter=False, color='red')
plt.title('Surprise(%) vs. Closing Price')
plt.xlabel('Surprise(%)')
plt.ylabel('Closing Price')

# Plotting Revenue vs. Stock Price
plt.subplot(1, 3, 3)
sns.scatterplot(x='Revenue', y='Close', data=merged_data, alpha=0.3)
sns.regplot(x='Revenue', y='Close', data=merged_data, lowess=True, scatter=False, color='red')
plt.title('Revenue vs. Closing Price')
plt.xlabel('Revenue')
plt.ylabel('Closing Price')

plt.tight_layout()
plt.show()

From above plots we can know Reported EPS vs. Stock Price: The plot for Reported EPS (Earnings Per Share) versus Stock Price reveals a strong positive relationship. This indicates that stocks with higher earnings per share tend to have higher stock prices. This is consistent with fundamental analysis principles, as EPS is a direct measure of a company's profitability on a per-share basis. A higher EPS typically signals better financial health and thus a potentially more valuable investment.

Although this feature is the strongest correlation, we still see many points deviating from the correlation line, which proves that the stock price is not only affected by Reported EPS, but also necessary to analyze other features.

Surprise(%) vs. Stock Price: The relationship between Surprise(%) and Stock Price is gently positive. Surprise(%) measures the difference between actual earnings and analyst estimates. A positive surprise indicates that the company performed better than expected, which can lead to a positive stock price reaction. However, the gentle slope suggests that while earnings surprises can influence stock prices, other factors are also at play, and the impact may not be as pronounced or consistent.

Revenue vs. Stock Price: The plot for Revenue versus Stock Price also shows a gently positive trend. This indicates that companies with higher revenues tend to have higher stock prices, but the relationship is not as strong as with EPS. This makes sense because while revenue is an essential indicator of company size and market presence, it does not account for costs and expenses. Therefore, it's not as strong a profitability indicator as EPS.

Scatter Matrix and Correlation Matrix¶

Explore the relationship between different features and conclude the final feature table.

In [ ]:
features_subset = merged_data[['Open', 'High', 'Low', 'Close', 'Volume', 'MACD', 'RSI', 'Bollinger Upper', 'Bollinger Lower', 'EPS Estimate', 'Reported EPS', 'Surprise(%)', 'Revenue']
]

# Pair plot
sns.pairplot(features_subset)
plt.show()
In [ ]:
# Selecting relevant financial indicators
financial_indicators = ['Open', 'High', 'Low', 'Close', 'Volume', 'MACD', 'RSI', 'Bollinger Upper', 'Bollinger Lower', 'EPS Estimate', 'Reported EPS', 'Surprise(%)', 'Revenue']

# Creating the correlation matrix
corr_matrix = merged_data[financial_indicators].corr()

# Plotting the correlation matrix
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Financial Indicators')
plt.show()

From above results(Scatter Matrix and Correlation Matrix), multiple features (such as Open, High, Low, Close, Bollinger Upper, and Bollinger Lower) show a perfectly linear relationship with a slope of 1, it implies that they are highly correlated, and changes in one feature are mirrored exactly by the others.

This means that these features carry very similar information for predictive modeling purposes. If they are perfectly or almost perfectly correlated, I will consider using just one of these features for future predictions to avoid redundancy.

I choose to only include the 'Close' price as prediction in my model in the feature(Open, High, Low, Close, Bollinger Upper, and Bollinger Lower), since it encapsulates the information provided by the other variables.

In addition, from previous analysis and scatter matrix, I select Volume, MACD, Reported EPS, Surprise(%), Revenue and stock history price as feature table used for future prediction. Because they are all related to stock close price especially Reported EPS with strong relationship to stock close price. RSI is the feature that we are not sure, so I decide to compare results with feature table including RSI and without RSI. From Correlation Matrix, I also find that Surprise(%), Revenue are weakly related to stock close price. So in the feature engineering, I will also compare the results with them or without them.

Models¶

Based on my previous analyses and feature selection, I can frame several distinct model questions that leverage different modeling approaches:

Model 1: Predicting Stock Price Movements with Supervised Learning¶

I plan to utilize historical stock data and financial indicators to predict future closing prices of stocks. Our independent variables will include 'Volume', 'MACD', 'Reported EPS', 'Surprise(%)', 'Revenue', and historical stock prices. I will experiment with including and excluding 'RSI' to assess its impact on predictive performance. Our dependent variable will be the 'Close' price of the stock on trading for the next quarter. I will employ a regression-based supervised learning model, such as a Random Forest Regressor, to capture the non-linear relationships observed in our exploratory data analysis. Our EDA supports this approach, as it revealed strong correlations between these features and the closing price. I will split our data into training and testing sets to evaluate the model's performance and ensure it generalizes well to unseen data.

Model 2: Optimizing Stock Trading Decisions with Contextual Bandits¶

I plan to deploy a Contextual Bandits model to optimize trading decisions in the stock market. State, Actions and Reward is defined as following: State (Context): The state consists of stock features of 'Close' price, 'Volume', 'MACD', 'Reported EPS', 'Surprise(%)', and 'Revenue'.

Actions: The actions is "buy", "sell", or "hold".

Reward: The reward function is the immediate profit or loss from an action. For example, if the agent decides to buy, and the stock price increases in the next time step, the reward is positive. Conversely, if the price decreases, the reward is negative.

Our EDA supports this approach, as it revealed strong correlations between these features and the closing price. I will split our data into training and testing sets. I will train the model using training data and evaluate its performance using a testing data.

Another Better Model maybe considered: Optimizing Stock Trading Decisions with Deep Reinforcement Learning (DRL)¶

Deep Reinforcement Learning (DRL) model maybe a better model to consider. The state space will consist of features such as 'Close' price, 'Volume', 'MACD', 'Reported EPS', 'Surprise(%)', and 'Revenue', alongside the agent's portfolio state. The action space will include potential trading actions: buy, sell, or hold. Our reward function will be designed to maximize portfolio value over time, encouraging the agent to learn profitable trading strategies. The EDA conducted has demonstrated the potential of these features to inform stock price movements, suggesting that a DRL agent could leverage these relationships to make informed trading decisions. I will split our data into training and testing sets. I will train the model using historical training data and evaluate its performance using a testing data.

These models will enable us to address different aspects of stock market prediction and trading optimization. The supervised learning model will focus on predicting future prices, while the RL model will aim at making profitable trading decisions.

Model1¶

Random Forest Regressor¶

The following is the preliminary code for Random Forest Regressor

In [ ]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

features = ['Volume', 'MACD', 'Reported EPS', 'Surprise(%)', 'Revenue']
target = 'Close'

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(merged_data[features], merged_data[target], test_size=0.2, random_state=42)

# Initialize the Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

# Fit the model
rf_regressor.fit(X_train, y_train)

# Make predictions
predictions = rf_regressor.predict(X_test)

# Calculate the mean squared error
mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)

print(f"Root Mean Squared Error: {rmse}")
Root Mean Squared Error: 39.561926741711375
In [ ]:
# Actual vs Predicted plot
plt.figure(figsize=(12,6))
plt.scatter(y_test, predictions, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=4)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs. Predicted Close Prices')
plt.show()

# Plot of residuals
residuals = y_test - predictions
plt.figure(figsize=(12,6))
plt.scatter(predictions, residuals, alpha=0.5)
plt.title('Residuals vs Predicted')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.axhline(y=0, color='red', linestyle='--')
plt.show()

The above is the preliminary results of Random Forest Regressor. As we see, in the Figure Actual vs. Predicted Close Prices, most of points are around the black line. But still some parts of points are far from the black line. In the future, we will do more feature engineering to imporve the performance.

Model2¶

Contextual Bandits: LinUCB¶

Contextual bandit is a Reinforcement Learning approach. We build a context bandit model to optimize trading decisions in the stock market. State, Actions and Reward is defined as following: State (Context): The state consists of stock features of 'Close' price, 'Volume', 'MACD', 'Reported EPS', 'Surprise(%)', and 'Revenue'.

Actions(Arms): The actions is "buy", "sell", or "hold".

Reward: we increase a constraint of position that means the current state of the stock holdings (positive for held stocks, negative if stocks have been sold). And trade_size is The number of stock units to buy or sell in each action; defaults to 1. The reward is the calculated profit or loss along with the updated position.

The following is the preliminary code for Contextual Bandits.

In [ ]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


# Normalize the features
scaler = StandardScaler()
feature_columns = ['Close', 'Volume', 'MACD', 'Reported EPS', 'Surprise(%)', 'Revenue']
merged_data[feature_columns] = scaler.fit_transform(merged_data[feature_columns])

# Split the dataset into train and test sets
train_data, test_data = train_test_split(merged_data, test_size=0.2, random_state=42)
In [ ]:
class LinUCB:
    def __init__(self, alpha, n_arms, n_features):
        self.alpha = alpha
        self.n_arms = n_arms
        self.n_features = n_features
        self.A = [np.identity(n_features) for _ in range(n_arms)]
        self.b = [np.zeros(n_features) for _ in range(n_arms)]

    def recommend_arm(self, x):
        p = np.zeros(self.n_arms)
        for arm in range(self.n_arms):
            A_inv = np.linalg.inv(self.A[arm])
            theta = A_inv @ self.b[arm]
            p[arm] = theta.T @ x + self.alpha * np.sqrt(x.T @ A_inv @ x)
        return np.argmax(p)

    def update(self, chosen_arm, x, reward):
        self.A[chosen_arm] += x @ x.T
        self.b[chosen_arm] += reward * x



def get_reward(chosen_arm, current_price, next_price, positions, trade_size=1):
    # Calculate the reward based on the action taken
    if chosen_arm == 1:  # buy
        if positions <= 0:
            profit_loss = (next_price - current_price) * trade_size
            positions += trade_size  # Update the position to reflect the purchase
        else:
            profit_loss = 0
    elif chosen_arm == 2:  # sell
        if positions >= 0:
            profit_loss = (current_price - next_price) * trade_size
            positions -= trade_size  # Update the position to reflect the sale
        else:
            profit_loss = 0
    else:  # hold
        profit_loss = 0  # No profit or loss if holding

    return profit_loss, positions
In [ ]:
# Initialize positions and bandit algorithm
alpha = 1  # controls exploration-exploitation trade-off
n_arms = 3  # Number of arms: buy, sell, hold
n_features = len(feature_columns)  # Number of features
bandit = LinUCB(alpha, n_arms, n_features)

# Training loop for the bandit on the training set
train_rewards = []
positions = 0

for index, row in train_data.iterrows():
    if index + 1 < len(train_data):
        # Extract the current and next price along with other features
        current_features = np.array(row[feature_columns].values, dtype=np.float64)
        current_price = row['Close']
        next_price = train_data.iloc[index + 1]['Close']

        chosen_arm = bandit.recommend_arm(current_features)

        reward, positions = get_reward(chosen_arm, current_price, next_price, positions)

        # Update the bandit
        bandit.update(chosen_arm, current_features, reward)

        # Keep track of the rewards
        train_rewards.append(reward)

# Evaluating on the test set
test_rewards = []
positions = 0

for index, row in test_data.iterrows():
    if index + 1 < len(test_data):
        # Extract the current and next price along with other features
        current_features = np.array(row[feature_columns].values, dtype=np.float64)
        current_price = row['Close']
        next_price = test_data.iloc[index + 1]['Close']

        chosen_arm = bandit.recommend_arm(current_features)

        reward, positions = get_reward(chosen_arm, current_price, next_price, positions)

        test_rewards.append(reward)
    else:
        # We cannot compute the reward for the last entry in the test set
        test_rewards.append(0)

# Compute cumulative rewards for plotting
cumulative_train_rewards = np.cumsum(train_rewards)
cumulative_test_rewards = np.cumsum(test_rewards)
In [ ]:
# Plot the cumulative rewards for the training set
plt.figure(figsize=(14, 7))
plt.plot(cumulative_train_rewards, label='Train Cumulative Reward')
plt.xlabel('Time step')
plt.ylabel('Cumulative Reward')
plt.title('Cumulative Reward Over Training')
plt.legend()
plt.show()

# Plot the cumulative rewards for the test set
plt.figure(figsize=(14, 7))
plt.plot(cumulative_test_rewards, label='Test Cumulative Reward', color='orange')
plt.xlabel('Time step')
plt.ylabel('Cumulative Reward')
plt.title('Cumulative Reward Over Testing')
plt.legend()
plt.show()

The above is the preliminary results of Contextual Bandits. As we see, Cumulative Reward Over Training and Cumulative Reward Over Testing gradually increase, which proves the model works. In the future, we will do more feature engineering to imporve the performance.

In [ ]:
!pip install --upgrade nbconvert
Requirement already satisfied: nbconvert in /usr/local/lib/python3.10/dist-packages (6.5.4)
Collecting nbconvert
  Downloading nbconvert-7.11.0-py3-none-any.whl (256 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 256.6/256.6 kB 4.7 MB/s eta 0:00:00
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (4.11.2)
Requirement already satisfied: bleach!=5.0.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (6.1.0)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.7.1)
Requirement already satisfied: jinja2>=3.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (3.1.2)
Requirement already satisfied: jupyter-core>=4.7 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.5.0)
Requirement already satisfied: jupyterlab-pygments in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.2.2)
Requirement already satisfied: markupsafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (2.1.3)
Collecting mistune<4,>=2.0.3 (from nbconvert)
  Downloading mistune-3.0.2-py3-none-any.whl (47 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48.0/48.0 kB 4.6 MB/s eta 0:00:00
Requirement already satisfied: nbclient>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.8.0)
Requirement already satisfied: nbformat>=5.7 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.9.2)
Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from nbconvert) (23.2)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (1.5.0)
Requirement already satisfied: pygments>=2.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (2.16.1)
Requirement already satisfied: tinycss2 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (1.2.1)
Requirement already satisfied: traitlets>=5.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.7.1)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from bleach!=5.0.0->nbconvert) (1.16.0)
Requirement already satisfied: webencodings in /usr/local/lib/python3.10/dist-packages (from bleach!=5.0.0->nbconvert) (0.5.1)
Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.10/dist-packages (from jupyter-core>=4.7->nbconvert) (3.11.0)
Requirement already satisfied: jupyter-client>=6.1.12 in /usr/local/lib/python3.10/dist-packages (from nbclient>=0.5.0->nbconvert) (6.1.12)
Requirement already satisfied: fastjsonschema in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.7->nbconvert) (2.18.1)
Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.7->nbconvert) (4.19.2)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->nbconvert) (2.5)
Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (23.1.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (2023.7.1)
Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (0.30.2)
Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7->nbconvert) (0.11.0)
Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (23.2.1)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (2.8.2)
Requirement already satisfied: tornado>=4.1 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (6.3.2)
Installing collected packages: mistune, nbconvert
  Attempting uninstall: mistune
    Found existing installation: mistune 0.8.4
    Uninstalling mistune-0.8.4:
      Successfully uninstalled mistune-0.8.4
  Attempting uninstall: nbconvert
    Found existing installation: nbconvert 6.5.4
    Uninstalling nbconvert-6.5.4:
      Successfully uninstalled nbconvert-6.5.4
Successfully installed mistune-3.0.2 nbconvert-7.11.0
In [ ]:
%%shell
jupyter nbconvert --to html /content/gdrive/MyDrive/data_science/mile_stone2/milestone2.ipynb
[NbConvertApp] Converting notebook /content/gdrive/MyDrive/data_science/mile_stone2/milestone2.ipynb to html
[NbConvertApp] WARNING | Alternative text is missing on 11 image(s).
[NbConvertApp] Writing 1777819 bytes to /content/gdrive/MyDrive/data_science/mile_stone2/milestone2.html
Out[ ]: